perm filename CHAP7.TEX[WEB,ALS] blob sn#690211 filedate 1982-12-15 generic text, type T, neo UTF8
\chapterbegin Chapter 7. How \TeX\ Reads\\What You Type

We observed in the previous chapter that an input manuscript is expressed
in terms of ``lines,'' but that these lines of input are essentially
independent of the lines of output that will appear on the finished pages.
Thus you can stop typing a line of input at any convenient place. A few
other related rules were also mentioned:

\medskip
\item\bull A $\langle\hbox{carriage-return}\rangle$ is like a space.

\smallskip
\item\bull Two spaces in a row count as one space.

\smallskip
\item\bull A blank line denotes the end of a paragraph.

\medskip
\noindent Strictly speaking, these rules are contradictory: A blank line
is obtained by typing $\langle\hbox{carriage-return}\rangle$ twice in a row,
and this is different from typing two spaces in a row. So now let's see what
the {\sl real\/} rules are. In this chapter and the next, we shall study
the very first stage in the transition from input to output.

\smallskip
In the first place, it's wise to have a precise idea of what your keyboard
sends to the machine. There are 128 characters that \TeX\ might encounter at
each step, in a file or in a line of text typed directly on your terminal. These
128@characters are classified into 16 categories numbered 0 to 15:
$$\halign{\hbox to\the\parindent{\hfil#\quad}&
#\hfil&\quad#\hfil\cr
\omit Category\qquad Meaning\hidewidth\cr
\noalign{\smallskip}
0&Escape character&(|\| in this manual)\cr
1&Beginning of group&(|{| in this manual)\cr
2&End of group&(|}| in this manual)\cr
3&Math shift&(|$| in this manual)\cr
4&Alignment tab&(|&| in this manual)\cr
5&End of line&(\<carriage-return> in this manual)\cr
6&Parameter&(|#| in this manual)\cr
7&Superscript&(|↑| in this manual)\cr
8&Subscript&(|_| in this manual)\cr
9&Ignored character&(\<null> in this manual)\cr
10&Space&(\vspace\ in this manual)\cr
11&Letter&(|A|, $\ldotss$, |Z| and |a|, $\ldotss$, |z|)\cr
12&Other character&(none of the above or below)\cr
13&Active character&(|@| in this manual)\cr
14&Comment character&(|%| in this manual)\cr
15&Invalid character&(\<delete> in this manual)\cr}$$
↑(escape character)
↑(begin-group character)
↑(end-group character)
↑(math mode character)
↑(alignment tab)
↑(parameter)
↑(superscript)
↑(subscript)
↑(ignored character)
↑(space)
↑(letter)
↑(other character)
↑(active character)
↑(comment character)
↑(invalid character)
It's not necessary for you to learn these code numbers; the point is only that
\TeX\ responds to 16@different types of characters. At first this manual led
you to believe that there were just two types---the escape character and the
others---and then you were told about two more types, the grouping
symbols |{| and |}|. In Chapter@6 you learned two more (|@| and |%|).
Now you know that there are really@16. This is the whole truth of the
matter; no more types remain to be revealed.  The category code for any
character can be changed at any time, but it is usually wise to stick to a
particular scheme.

The main thing to bear in mind is that each \TeX\ format reserves certain
characters for its own special purposes. For example, when you are using plain
\TeX\ format (Appendix@B\null), you need to know that the ten characters
\ttbegin
\ { } $ & # ↑ _ % @
\ttend
cannot be used in the ordinary way when you are typing;
↑(special characters)
↑(backslash)↑(left brace)↑(right brace)↑(dollar sign)↑(ampersand)
↑(hash mark)↑(caret)↑(underline)↑(percent)↑(at sign)
↑(single-character control sequences)
each of them will cause \TeX\ to do something special, as explained elsewhere
in this manual. If you really need these symbols as part of your manuscript,
plain \TeX\ makes it possible for you to type
$$\halign{\indent#\hfil&\qquad#\hfil\cr
|\$| for \$,& |\%| for \%,\cr
|\&| for \&,& |\@| for \@,\cr
|\#| for \#,& |\_| for \_;\cr}$$
the |\_| symbol is useful for {\it compound\_identifiers\/} in computer
programs. In mathematics formulas you can use |\{| and |\}| for $\{$ and
$\}$, while ↑{:rslash} produces a ↑{reverse slash}; for example,
$$\displaybox{`|$\{a \rslash b\}$|'\quad yields\quad `$\{a\rslash b\}$'.}$$
Furthermore |\↑| produces a circumflex accent (e.g., `|\↑e|' yields
`\↑e'\thinspace).

\exercise What horrible errors appear in the following sentence?
\ttbegin
Proctor \& Gamble's stock climbed to \$2, a 10\% gain.
\ttend
\answer The spaces after `|\&|' and `|\%|' will disappear; one should type
\ttbegin
Proctor \&\ Gamble's ... 10\%\ gain.
\ttend
(Also the facts are wrong.)

\exercise Can you imagine why the designer of plain \TeX\ decided not
to make `|\\|' the control sequence for reverse slashes?↑(backslash)
\answer Reverse slashes (backslashes) are fairly uncommon in formulas or
text, and |\\| is very easy to type; it was therefore felt best not to
reserve |\\| for such limited use. Typists can define |\\| to be whatever
they want (including |\rslash|).

\danger When \TeX\ reads a line a text from a file, or a line of text that
you entered directly on your terminal, it converts that text into a list of
``↑{tokens}.'' A token is either (a)@a single character with an attached
category code, or (b)@a control sequence. For example, if the conventions
of plain \TeX\ are in force, the text `|{\hskip 36 pt}|' is converted into
a list of eight tokens:
$$\dbox{|{|$↓1$\quad|\hskip|\quad|3|$↓{12}$\quad|6|$↓{12}$\quad
  \vspace$↓{10}$\quad|p|$↓{11}$\quad|t|$↓{11}$\quad|}|$↓{2}$\hss}$$
The subscripts here are the category codes, as listed earlier: 1 for
``beginning of group,'' 12 for ``other character,'' and so on. The
|\hskip| doesn't get a subscript, because it's a control sequence token
instead of a character token. Notice that the space after |\hskip| does
not get into the token list.

\danger It is important to understand the idea of token lists, if you want
to gain a thorough understanding of \TeX, and it is convenient to learn
the concept by thinking of \TeX\ as if it were a living organism. The 
individual lines
of input in your files are seen only by \TeX's ``eyes'' and ``mouth''; but
after that text has been gobbled up, it is sent to \TeX's ``stomach'' in
the form of a token list, and the digestive processes that do the actual
typesetting are based entirely on tokens. As far as the stomach is concerned,
the input flows in as a stream of tokens, somewhat as if your \TeX\
manuscript had been typed all on one extremely long line.

\danger You should remember two chief things about \TeX's tokens: (1)@A
control sequence is considered to be a single object that is no longer
composed of a sequence of letters. Therefore long control sequence names
are no harder for \TeX\ to deal with than short ones, once they have been
converted to tokens. Furthermore, spaces are not ignored after control
sequences inside a token list; the ignore-space rule applies only in an
input file, during the time that strings of characters are being
tokenized.  (2)@Once a category code has been attached to a character
token, the attachment is permanent. For example, if character `|{|' were
suddenly declared to be of category@12 instead of category@1, the
characters `|{|$↓1$' already inside token lists of \TeX\ would still
remain of category 1; only newly-made lists would contain `|{|$↓{12}$'
tokens. In other words, individual characters receive a fixed
interpretation as soon as they have been read from a file, based on the
category they have at the time of reading. Control sequences are
different, since they can change their interpretation at any time.  \TeX's
digestive processes always know exactly what a character token signifies,
because the category code appears in the token itself; but when the
digestive processes encounter a control sequence token, they must look up
the current definition of that control sequence in order to figure out
what it means.

\dangerexercise Some of the category codes 0 to 15 will never appear as
subscripts in character tokens, because they disappear in \TeX's mouth.
For example, characters of category 0 (escapes) never get to be tokens.
Which categories can actually reach \TeX's stomach?
\answer 1, 2, 3, 4, 6, 7, 8, 10, 11, 12. Active characters (type 13)
are considered to be control sequences rather than character tokens.

\ddanger There's a program called ↑{.INITEX} that is used to install
\TeX, starting from scratch; |INITEX| is like \TeX\ except that it can
do even more things. It can compress hyphenation patterns into special
tables that facilitate rapid hyphenation, and it can
produce format files like `|plain.fmt|' from `|plain.tex|'.
But |INITEX| needs extra space to carry out such tasks, so it generally
has less memory available for typesetting than you would expect to find in a
production version of \TeX.

\ddanger When |INITEX| begins, it knows nothing
but \TeX's primitives. All 128@characters are initially of category@12,
except that ↑{<carriage-return} has category@5,
↑{<space} has category@10, ↑{<null} has category@9, ↑{<delete} has category@15,
the 52 letters |A|$\,\ldotss$|Z| and |a|$\,\ldotss$|z| have category@11,
and ↑{backslash} has category@0.
It follows that |INITEX| is initially incapable of carrying out some of
\TeX's primitives that depend on grouping; you can't use |\def| or |\hbox|
until there are characters of categories 1 and@2.
Appendix@B begins with ↑{*catcode} commands to provide characters of the
necessary categories; for example,
\ttbegin
\catcode`\{=1
\ttend
assigns category 1 to the |{| symbol. The |\catcode| operation is like
many other primitives of \TeX\ that we shall study later; by modifying
internal codes like the category codes, you can adapt \TeX\ to a wide
variety of applications.

\ddangerexercise Suppose that `|\catcode`\<=1 \catcode`\>=2| appears
near the beginning of a group that begins with `|{|'; these specifications
instruct \TeX\ to treat |<| and |>| as group delimiters. According to
\TeX's rules of locality, the characters |<| and |>| will revert to
their previous categories when the ↑{group} ends. But should the group
end with |}| or@with@|>|\thinspace?
\answer It ends with any character of category 2; then the effects of
all |\catcode| definitions within the group are wiped out, except those
that were ↑{*global}. \TeX\ doesn't have any built-in knowledge about
how to pair up particular kinds of grouping characters.

\ddanger Although control sequences are treated as single objects,
\TeX\ does provide a way to break them into lists of character tokens:
If you write ↑{*string}|\cs|,
where |\cs| is any control sequence, you get the list of characters for that
control sequence's name. For example, |\string\TeX| produces four tokens:
|\|$↓{12}$, |T|$↓{12}$, |e|$↓{12}$, |X|$↓{12}$. Each character in this token
list automatically gets category code@12 (``other''),
including the backslash that |\string| inserts to represent an escape
character.  However, category@10 will be assigned to the character `\vspace'
(blank ↑{space}) if a space character somehow sneaks into the name of a
control sequence.

\ddanger Conversely, you can go from a list of character tokens to a
control sequence by saying `↑{*csname}\<tokens>↑{*endcsname}'. The tokens
that appear in this construction between |\csname| and |\endcsname| may
include other control sequences, as long as those control sequences
ultimately expand into characters instead of \TeX\ primitives; the final
characters can be of any category, not necessarily letters. For example,
`|\csname TeX\endcsname|' is essentially the same as `|\TeX|'; but
`|\csname\TeX\endcsname|' is illegal, because |\TeX| expands into tokens
containing the |\kern| primitive. Furthermore,
`|\csname\string\TeX\endcsname|' will produce the unusual control sequence
`|\\TeX|', which you can't ordinarily write.

\ddangerexercise Experiment with \TeX\ to see what |\string| does when it
is followed by an ↑{active character} like |@|. \ (Active characters behave
like control sequences, but they are not prefixed by an escape.) \ What
is an easy way to conduct such experiments online? What control sequence
could you put after |\string| to@obtain the single character
token@|\|$↓{12}$?
\answer If you type `|\message{\string@}|' and `|message{\string\@}|', \TeX\
responds with `|@|' and `|\@|', respectively. ↑(*message)
To get |\|$↓{12}$ from |\string| you therefore need to make backslash an
active character. One way to do this is
\ttbegin
{\catcode`/=0 \catcode`\\=13 /message{/string\}}
\ttend
(The ``↑{null control sequence}'' that you get when there are no
tokens between |\csname| and |\endcsname| is not a solution to this exercise,
because |\string| converts it to `|\csname\endcsname|'.)

\ddangerexercise What tokens does
`|\expandafter\string\csname a\string\   b\endcsname|' produce?
(There are three spaces before the |b|.)
\answer |\|$↓{12}$ |a|$↓{12}$ |\|$↓{12}$ \vspace$↓{12}$ |b|$↓{12}$.

\ddangerexercise (To be worked after you have learned all about macros.) \
Define a control sequence |\appno| with three parameters such that
|\appno#1#2#3| defines control sequence |#1| to expand to a control sequence
whose name is the name of control sequence |#2| followed by the value of
the positive integer |#3| expressed in ↑{roman numerals}. For example,
suppose |\count20| equals 82; then `|\appno\a\TeX{\count20}|' should have
the same effect as `|\def\a{\TeXlxxxii}|'.↑(tricky macros)
\answer (We assume that parameter |#2| is not simply an active character.)
\ttbegin
\def\gobble#1{} % remove one token
\def\appno#1#2#3{\edef#1{\def#1{\csname
      \expandafter\gobble\string#2\number-#3\endcsname} } #1}
\ttend

\chapterend

Some bookes are to bee tasted,
others to bee swallowed,
and some few to bee chewed and disgested.
\author FRANCIS ↑{BACON}, {\sl Essayes\/} (1597) % p2 of orig edition

\bigskip

`Tis the good reader that makes the good book.
\author RALPH WALDO ↑{EMERSON}, {\sl Society \&\ Solitude\/} (1870) % Success

\eject